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Abstract 

The coupling of electronic medical records (EMR) with genetic data has created the potential for implementing reverse 
genetic approaches in humans, whereby the function of a gene is inferred from the shared pattern of morbidity among 
homozygotes of a genetic variant. We explored the feasibility of this approach to identify phenotypes associated with low 
frequency variants using Vanderbilt's EMR-based BioVU resource. We analyzed 1,658 low frequency non-synonymous SNPs 
(nsSNPs) with a minor allele frequency {IV1AF)<10% collected on 8,546 subjects. For each nsSNP, we identified diagnoses 
shared by at least 2 minor allele homozygotes and with an association p<0.05. The diagnoses were reviewed by a clinician 
to ascertain whether they may share a common mechanistic basis. While a number of biologically compelling clinical 
patterns of association were observed, the frequency of these associations was identical to that observed using genotype- 
permuted data sets, indicating that the associations were likely due to chance. To refine our analysis associations, we then 
restricted the analysis to 711 nsSNPs in genes with phenotypes in the On-line Mendelian Inheritance in Man (OIVIIIVI) or 
knock-out mouse phenotype databases. An initial comparison of the EIVIR diagnoses to the known in vivo functions of the 
gene identified 25 candidate nsSNPs, 19 of which had significant genotype-phenotype associations when tested using 
matched controls. Twieve of the 19 nsSNPs associations were confirmed by a detailed record review. Four of 12 nsSNP- 
phenotype associations were successfully replicated in an independent data set: thrombosis (F5,rs6031), seizures/ 
convulsions {GP/?9S,rsl 31 57270), macular degeneration (C/VGg3,rs3735972), and Gl bleeding (HGF/\Crs 16844401). These 
analyses demonstrate the feasibility and challenges of using reverse genetics approaches to identify novel gene-phenotype 
associations in human subjects using low frequency variants. As increasing amounts of rare variant data are generated from 
modern genotyping and sequence platforms, model organism data may be an important tool to enable discovery. 
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introduction 

Electronic medical record (EMR) systems store an increasing 
amount of clinical, laboratory and biometric data generated by 
health care systems. These data offer opportunities to explore risk 
factors for diseases, the inter-relationships among disease entities, 
and determinants of treatment response in large populations of 
individuals [1]. EMR data integrated with DNA repositories can 
also be utilized to identify genetic contributions to human disease 
risk and treatment response [2-7]. The spectrum of disease entities 
collected in EMRs has also enabled large-scale bioinformatics 
approaches such as Phenome-Wide Association Study (PheWAS), 
which searches in a disease-agnostic fashion for associations 
between common polymorphisms and hundreds of clinical 
diseases, identified using billing codes [8,9]. The success of 
PheWAS approaches for common variants suggests that similar 
EMR-based approaches may identify associations with low 
frequency or rare variants [4,10,11]. 



Experimental model systems such as mouse models have been 
successful in assigning functionality to genes through the use of 
reverse genetics approaches, which identify phenotypes associated 
with a known genetic lesion [12,13]. Structured data derived from 
mouse studies are increasingly available through large coordinated 
efforts such as the Knock-out Mouse Project (KOMP) [14] and the 
Mouse Phenome Database [15]. These data sources provide a rich 
resource for generating biologically-relevant clinical hypotheses 
based on observations of model organisms that can now be tested 
in a real life setting using large EMRs coupled with DNA 
repositories, such as the Vanderbilt BioVU resource [16]. 

Rare and low frequency single nucleotide polymorphisms 
(SNPs) are appealing candidates to explain much of the variation 
in human traits that cannot be accounted for by common 
polymorphisms [17]. However, associating rare variants to disease 
represents a considerable methodological challenge and remains 
an area of active research [18,19]. From an epidemiological 
standpoint, low frequency variants are of particular interest 
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because they can be associated with large effect sizes, enabling 
genetic approaches to discovery [20-22]. 

The coupling of EMR data with rare variant genetic data has 
created the potential for implementing reverse genetics approaches 
in humans, whereby the function of a gene is inferred from the 
shared pattern of morbidity among homozygotes of a genetic 
variant [23]. We explored the feasibility of this idea using 1,658 
low frequency non-synonymous SNP (nsSNP) variants and clinical 
phenotypes derived from Vanderbilt's EMR-based BioVU re- 
source [16]. We found that, taken alone, phenot)'pe association 
data did not yield associations statistit:ally different from chance. 
To identify biologically-relevant genetic associations, we analyzed 
711 nsSNPs in genes with in vivo functional genetic data reported 
in the OMIM (On-line Mendelian Inheritance in Man) or the 
knock-out mouse phenotype databases, both of which catalog a 
partial spectrum of disease associated with loss-of-function 
mutations. This approach yielded 12 candidate genotype-pheno- 
type a.ssociations, four of which we replicated in an independent 
data set. This approach suggests a potential for important biologic 
association discovery as platforms genotyping hundreds of 
thousands of rare nsSNPs are deployed across EMRs. 

Materials and Methods 

Ethics Statement 

All data for these analyses was extracted from the VanderbUt 
DNA Databank, BioVU, which accrues DNA samples extracted 
from leftover blood remaining from routine clinical testing. This 
resource has been approved as non-human subjects research by 
Vanderbilt's local Institutional Review Board and the ft-deral 
Office of Human Research Protections (OHRP), and has been 
described in detail previously [16,24]. Briefly, BioVU is linked to a 
de-identified Electronic Medical Record (EMR) system in which 
all personal identifiers have been removed, and subjects may elect 
to be removed from BioVU at any time. This study was also 
reviewed by the VanderbUt Institutional Review Board and 
determined to be non-human subjects research. 

Study population 

A total of 8,546 subjects who had previously been genotyped at 
Vanderbilt University Medical Center (VUMC) were used in the 
analysis. The subjects belonged to three cohorts identified from 
BioVU, a de-identified collection of DNA samples extracted from 
discarded blood and linked to de-identified EMRs [16]. Two 
cohorts were assembled as part of the Vanderbilt Genome 
Electronic medical Records (VGER) project within the electronic 
Medical Rectjrds and genomics (eMERGE) network [2] . The first 
VGER cohort (VGER-660) was comprised predominantiy of 
EMR-defined white European ancestry subjects (N = 3,174), and 
the second (V GER- 1 M) was comprised predominantly of EMR- 
defined black African American subjects (n= 1,558). These 
cohorts were selected for genotyping using phenotype selection 
algorithms that identified individuals with normal cardiac 
conduction or type 2 diabetes (and their controls) [5,25]. Subjects 
in the third cohort were selected from BioVU by an ongoing study 
(V anderbUt Electronic Systems for Pharmacogenomic Assessment; 
VESPA) examining the genomics of drug response [26] (n = 3,940; 
Table SI). The largest VESPA studies are examining antibiotic 
responsiveness (n = 2,476 subjects) and transplant patients (n = 92 1 
subjects). Race assignment was determined using STRUCTURE 
[27]: European ancestry (EA) was defined as subjects with a >90% 
probability of being in the CEU cluster, and African ancestry (AA) 
was defined as subjects with a >90% of being in the YRI cluster, 
using HapMap populations as references. 



SNP selection 

Genotype data were acquired on one of three genotyping 
platforms: the lUumina Human660W-Quadvl_A genotyping 
platform (VGER-660), tire lUumina Human IM-Duo (VGER- 
IM), or the Illumina OmniLQUAD (VESPA). Each dataset was 
separately cleaned using the quaUty control pipeline developed by 
the eMERGE Genomics Working Group [28]. This entailed 
identifying gender mismatches, identifying SNPs failing concor- 
dance with HapMap, l)atch (;ff("cts, and identification of duplicate 
and related individuals. After quality control analyses, the data sets 
were merged. The merged data set contained genotype informa- 
tion on 1,545,817 SNPs present on one or more of the genotyping 
platforms. 

An overview of the SNP selection process is shown in Figure 1 . 
Non-synonymous SNPs (nsSNPs) that had a MAE less than 10% in 
both EA and AA populations and had more than 1 0 minor allele 
homozygotes were selected for analysis. nsSNPs with less than 10 
minor allele homozygotes were excluded to reduce statistical biases 
associated with very small sample sizes. A total of 1,658 nsSNPs 
met these initial inclusion criteria. The mean MAP was 
5.3%±.3.1% (SD) and 4.7%±3.2 (SD) for EAs and AAs, 
respectively. The median number of subjects with genotype data 
avaUable for a given nsSNP was 4,750±2,097 (SD). Of the 1,658 
nsSNPs initially identified, 440 were located in genes with disease 
associations in the OMIM database, 555 were in the KO mouse 
data set. In total, 711 nsSNPs were located in 59 1 genes found in 



1,545,817 SNPs from 8,546 subjects 



Initial SNP selection: 
nsSNPs with a MAF<10% and >10 HZ 

(n=1,658) 



100 genotype-permuted 
data sets generated 



SNPs in genes with OIVIIIVI/KO 

mouse data 
(n=711) 



Comparison of the 
number of significantly 
associated diagnoses 



No difference between 
real and permuted data 



Preliminary candidate screen: 
SNPs with diagnosis codes 
similar to OIVIIM/KO 
phenotypes 
(n=25) 



Statistical testing with 
matched controls: 
SNPs with p<0.05 
(n=19) 



Record review for phenotype 
confirmation 
(n=12SNPs) 



Figure 1 . Overview of tlie nsSNP selection process. There was no 
difference in number of diagnoses significantly associated with the 
1,658 nsSNPs when compared to genotype-permuted data. Hence, a 
nsSNP selection strategy that compared to diagnoses to those reported 
in either OIVIIM or the KO Mouse data was used. A multi-step selection 
and review process identified 12 candidate nsSNPs. 
dol:1 0.1 371/journal.pone.01 00322.g001 
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either the OMIM or the KO mouse data set and 284 nsSNPs were 
in both. 

Clinical data extraction 

Clinical diagnoses, symptoms and problems for each subject 
were extracted from the Vanderbilt University Medical Center 
(VUMC) Synthetic Derivative, a de-identified image of the 
Vanderbilt EMR [16]. Diagnoses were derived from ICD-9 and 
physician-maintained problem lists. Problem lists were manually 
reviewed to correct misspellings and expand abbreviations and 
diagnoses were then mapped to their corresponding ICD-9 code 
using text matching. There were 1 3 instances where a new clinical 
code was created (e.g. AV nodal re-entry tachycardia) in order to 
capture the diagnosis with specificity (these codes can be found in 
Table S2). Cancer diagnoses were not included in these analyses as 
the molecular phenotypes described in the Mouse Phenotype 
database could not be easily mapped to a specific cancer type. 
After extraction and mapping of problem list entries, there were 
8,275 unique clinical codes. In this study, we did not use the 
predefined list of PheWAS phenotypes but created a new one, as 
doing so allowed the most appropriate mapping of diagnoses 
experienced in the individuals [8] . De novo creation of aggregations 
based on those phenotypes in patients with rare nsSNPs 
theoretically enhanced our sensitivity to create potential unfore- 
seen aggregations with rare nsSNPs that may not be found in the a 
priori PheWAS codes. These were aggregated into 1,609 groups of 
related codes (see Table S2 for ICD-9 groupings). 

Identifying Candidate Associations 

In order to identify genotype-phenotype associations, we 
generated a list of sdl diagnoses present in two or more of the 
homozygotes for the minor allele for each nsSNP. Any problem 
that appeared on more than 5 % of these lists across all nsSNPs was 
excluded, as this was typically caused by rarely used diagnosis 
codes for which just 1 or 2 cases present among the minor allele 
homozygotes would give a strong association p-value. For each 
common diagnosis, a two-sided Fisher's exact test was used to 
compare the proportions of affected minor allele homozygotes to 
afiFected common allele homozygotes. The heterozygotes were not 
used in the analysis to simplify the analysis and prevent a loss of 
power associated with model misclassification if the wrong 
association model was chosen (e.g., additive instead of recessive 
or dominant). A composite list of all diagnoses with an a priori 
Fisher's exact p-value less than 0.05 was then generated for each 
nsSNP. 

To estimate the number of significant nsSNP-phenotype 
associations expected by chance, permutation testing was 
employed. We generated 100 randomized data sets by taking 
the 1,658 nsSNPs and permuting the link between the genotypes 
and phenotypes (i.e. the genotype values for a nsSNP were 
randomly redistributed across all patients while keeping their 
phenotypes intact). We then compared the number nsSNPs having 
diagnoses with a Bonferroni-corrected Fisher's exact p-value<0.05 
using the actual genotype data to the numbers of significant 
diagnoses associated with each of 100 randomized data sets. We 
also compared the average number of diagnoses associated with an 
nsSNP with a p<0.05. 

Based on the permutation analyses, we found that real and 
randomized genetic data could not be distinguished on the basis of 
statistical outliers. Hence, we restricted all subsequent analyses to 
the 7 1 1 nsSNPs in the OMIM or the KO mouse data sets. These 
7 1 1 nsSNPs first underwent a human review comparing the 
phenotypes from the KO mouse and OMIM databases to the 
composite diagnosis list from the minor allele homozygotes. 



nsSNPs that were associated with diseases arising from a 
pathophysiological mechanism and organ system distribution that 
was comparable to the known function of the gene were selected 
for further review. nsSNPs were also included if the disease 
mechanism in the homozygotes appeared to be opposite of that 
described, as this could occur if an nsSNP was associated with a 
gain-of-function mutation. For 686 of the 711 nsSNPs, the 
candidate associations were deemed inconsistent with the KO 
mouse and OMIM databases. In all, 25 nsSNPs were selected for 
further evaluation. 

Association testing using matched controls and EMR 
validation 

To more rigorously test each of the 25 nsSNPs identified above, 
we developed a clinical phenotype definition using composites of 
diagnosis codes that best approximated the phenotype descriptions 
in the OMIM and KO mouse databases (Table S3). For example, 
the PTAFR gene is associated with infection susceptibility including 
streptococcal infections [29,30]. Hence, phenotypes comprised of 
ICD-9 codes for streptococcal-associated diseases including 
respirator}' infections, streptococcal infections, sepsis, sinusitis 
and meningitis were defined. A significant association with at 
least one of these phenotypes was rerjuired in order for the nsSNP 
to be considered to be associated with the phenotype. In instances 
where numerous possible clinical presentations were possible 
based the phenotype description of the KO mouse, the phenotype 
was defined to incorporate the diagnoses observed during the 
initial nsSNP review. For example, CLECIB was associated with 
abnormal blood vessel morphology in mice. The initial case review 
identified elevated rates of intracranial hemorrhage among the 
minor allele homozygotes for an nsSNP in this gene. Hence, this 
phenotype was specifically evaluated. 

Univariate exact logistic regression comparing minor allele 
homozygotes to common allele homozygotes was used to test 
associations. The common allele homozygotes were a random 
sample individually matched to the minor allele homozygotes on 
age strata (0-4 years, 5-19 years, 20-44 years, 45-60 years and 60-1- 
years) gender, race and data set. Binomial power calculations 
assuming P(disease in cases) = 30%, P(disease in controls) = 10%, 
number of cases = 25, alpha = 0.05 and beta = 0.80 showed that 
800 controls were needed per nsSNP. Depending upon the 
availability of matched controls, between 800 and 1,800 matched 
controls were selected per nsSNP. AU nsSNPs that failed to show a 
statistical association (defined as a p-value less than 0.05) with at 
least one phenotype were not considered for further review. Of the 
25 nsSNPs, 19 had significant associations. 

After statistical testing, the electronic records of the minor allele 
homozygotes for the 19 nsSNPs were reviewed by a clinician to 
confirm that their clinical records supported their diagnoses 
inferred from the ICD-9 codes and problem lists. This review was 
used to ascertain whether any conditions comprising one of the 
phenotype definitions may have been previously ruled out or may 
have a known etiology' that would preclude an underlying genetic 
explanation. For instance, an ICD-9 code for joint pain in a 
patient for which a clinical record review indicated that the patient 
had an ankle fracture would not be considered a possible 
manifestation of gout. If clinical record review indicated that the 
ICD-9 codes did not support a diagnosis related to the function of 
the nsSNP, the nsSNP was excluded from further analysis, 
resulting in exclusion of 7 of the 19 SNPs. 

Replication analyses 

Of the 12 candidate nsSNPs, 10 were available in an EMR- 
derived replication cohort that underwent genotyping using the 
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Table 1. Population characteristics. 




Total Subjects (n) 


8645 




No. males (%) 


4079 (47.2) 


No. females {%) 


4566 (52.8) 


No. European Ancestry (%) 


6002 (69.4) 


No. African American (%) 


1734 (20.1) 


No. other races (%) 


909 (10.5) 


Mean (std) age of last available diagnosis (years) 


52 (18) 


Mean (std) duration of EMR follow-up (years) 


7 (5) 


doi:l 0.1 371 /journal.pone.Ol 00322.t001 



Illumina Infinium Exome BeadChip. The replication set con- 
tained 19,599 EAs and 1,993 AAs over the ages of 30 years old 
who were genotyped as part of broad-based genotyping initiative 
at VanderbHt. Quality control procedures for the Exome chip data 
have been previously described [31]. In brief, quality control was 
performed by VANGARD (VanderbUt Technologies for Ad- 
vanced Genomics Analysis and Research Design) and samples 
were analyzed in conjunction with over 32,000 other BeadChip 
samples. After clustering, samples were then evaluated for 
heterozygous consistency rate between duplicated samples and 
HAPMAP samples, gender mismatches, Mendelian errors, dupli- 
cate identification and exclusion of subjects more closely related 
than half-siblings. Data were filtered for a sample and genotype 
call rate>99% and deviation for Hardy Weinberg equilibrium 
(p>0.001). Phenotype data was based strictiy on ICD-9 codes with 
cases defined as subjects with 1 or more codes and controls defined 
as those subjects with no related codes. Only those phenotypes 
with an association p-value<0.05 in the original analyses and with 
>50 cases in the replication set were evaluated. EAs and AAs were 
analyzed separately using an additive multivariable genetic model 
adjusting for age, gender and 3 principal components. A 
replication p-value<0.0.5 was considered statistically significant. 

Data analysis and external data sources 

All quality control analyses of nsSNP genotyping data were 
performed using PLINK V 1.0 7 [32]. Principal components were fit 
using EIGENSTRAT [33]. All post quality-control statistical 
analyses were performed using SAS v9.3 (SAS Institute, Gary, 
NC). Gene-disease associations were downloaded from OMIM 
(http://omim.org/). Phenotype information for knock-out (KO) 
mouse models was downloaded from the mouse genome 
informatics resource (http://www.informaticsjax.org). These data 
sources were current as of 6/24/2012. 

Results 

Permutation analyses 

8,546 subjects who had previously undergone SNP genotyping 
were used in this study (Table 1). Approximately 70% of the study 
population was EAs. The mean age of their most recent clinical 
encounter was 52 years and an average of 7 years of clinical data 
was available for each subject. Two approaches to identifying 
candidate genotype-phenotype associations were used, as oudined 
in Figure 1. For the first approach, a preliminary review of 
phenotypes that were associated with the minor allele homozy- 
gotes for the 1,658 low MAF nsSNPs identified a number of 
compelling patterns of disease associations. For instance, the 



associations between the nsSNP (rs33947968) in the Mjo3A gene 
encompassed a clinical disease spectrum that would suggest that 
this nsSNP contributes to cardiopulmonary disease (Table S4). 
However, similarly compelling phenotypic patterns were seen in 
reviews of associations derived from genotype-randomized data, 
suggesting that these associations were likely due to chance. 
Consistent with this notion, the number of the 1,658 nsSNPs with 
clinical associations with a Bonferroni-adjusted p<0.05 was 
similar between the real (n = 188 nsSNPs) and 1 00 genotype- 
randomized data sets (median n = 194, inter-quartile range =184- 
204), as was the average number of diagnoses associated with a 
nsSNP with an unadjusted p<0.05 (n= 19.2 for real data vs. a 
median of 20.0 [IQR 19.8-20.1] for permuted sets). In addition, a 
role for Myo3A in cardiopulmonary disease is not consistent with its 
known biology, as expression of this gene is restricted to the ear 
and known mutations cause deafness [34]. Based on these results, 
we concluded that a completely agnostic approach to candidate 
nsSNP identification would result in a very high likelihood of 
biologically-implausible, false positive associations. 

SNP-phenotype associations using KO mouse and OMIM 
data 

In order to identify biologically-plausible gene-phenotype 
associations, we restricted subsequent analyses to 711 of the 
1,658 nsSNPs located in genes with functions described in the 
OMIM or KO mouse data sets. Of these 7 1 1 nsSNPs, the minor 
allele homozygotes for 25 had diagnosis codes (with an association 
p<0.05) consistent with the known function of the gene containing 
the SNP. Six of these nsSNP-phenotype clusters were excluded 
because the genotypes were not significandy associated (p>0.05) 
with disease in analyses using matched controls. The medical 
records for each minor allele homozygote for the remaining 19 
nsSNPs were reviewed to confirm that their clinical data supported 
their coded data. Seven of the 19 nsSNPs were excluded after this 
review because the chnical records suggested a disease etiology 
that was not consistent with the known physiology of the gene. For 
example, while there was a statistically significant increase in chest 
pain among homozygotes for an nsSNP in DMAH5, a gene 
associated with respiratory ciliary disorders and bronchiectasis, the 
chest pain was generally attributed to external/traumatic causes 
rather than intrinsic lung disease. (See Table S5 and Table S6 for 
details of the 13 nsSNPs excluded in these steps). 

Of the twelve nsSNPs that advanced through all steps of tiie 
selection process, the mean MAF was 6.3% and 5.0% in EAs and 
AA, respectively, and the mean number of homozygotes for each 
nsSNP was 36 (Table 2). Two nsSNPs {ERCC4 and PLCG2) were 
predicted to be damaging by PolyPhen-2 [35] analysis and one 
encoded a nonsense mutation (TAARl). The phenotypes for 1 and 
5 of the 12 nsSNPs were described only in the OMIM or KO 
mouse databases, respectively, and the other 6 were described in 
both databases (Table S3). Results of association testing with 
matched controls are shown in Table 3 and the problem lists for 
these SNPs are shown in Table S7. 

Replication analyses 

The significant associations for 10 of the 12 nsSNPs were 
evaluated using an additive genetic model in an independent data 
set. Replicated associations were observed for 4 of the 10 genes 
(Table 4): CMGB3 (macular degeneration in EAs, OR = 1.2 [1.0- 
1.4], p = 0.03), F5 (stroke in AAs, OR= 1.4 [1.0-1.9], p = 0.04), 
GPR98 (convulsions in AAs, OR = 1.9 [1.1-3.3], p = 0.02) and 
HGFAC (GI bleeding in EAs, OR= 1.2 [1.0-1.4], p = 0.02). The 
association of GI bleeding with HGFAC [36] in humans has not 
been described. 
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In the present study, we evaluated the feasibility of identifying 
gene-phenotype associations using low MAF nsSNPs in conjunc- 
tion with data extracted from the VUMC BioVU resource, an 
integrated collection of genotype and EMR data. We found that 
an agnostic approach based strictly on statistical outhers identified 
a number of nsSNPs with clinically interesting patterns of disease 
associations, but permutation analyses suggested that these 
associations were likely due to chance. To circumvent this 
problem, we used in vivo functional genomic data to identify 
clinically-relevant candidate gene-phenotype associations. Our 
approach incorporated a clinical/biological review process that 
identified biologically plausible candidate phenotypes associated 
with 12 nsSNPs. Of the 10 candidates nsSNPs evaluated in 
replication analyses, 4 nsSNPs had significant associations: CMGB3 
(macular degeneration in EAs), F5 (stroke in AAs), GPR98 
(convulsions in AAs) and HGFAC (GI bleeding in EAs). 

We restricted our analyses to minor allele homozygotes, as these 
subjects would be expected to manifest the deleterious effects of a 
nsSNP variant if the mode of genetic action is either additive or 
recessive [37]. We tested the hypothesis that a review of clinical 
codes shared among individuals homozygous for a nsSNP by an 
expert clinician would identify clinical disease patterns that would 
suggest a common predisposing genetic lesion. When the clinical 
review was conducted without a priori knowledge of the function of 
the gene, we observed that there were a number of false positive 
leads, which were due to the fact that a number of clinical codes 
often co-occur within a patient and, thus, can create a 
constellation of associations that would suggest that the homozy- 
gous carriers had a functional genetic lesion. For instance, patients 
with a cardiac valvular disorder may also have a number of 
specific and non-specific related cardiac codes such as "Cardiac 
complications", "Heart failure" and "Cardiac dysrhythmias". 
Hence, these codes may cluster, giving the impression that it is 
associated with a heavy burden of cardiac disease. To mitigate 
these false positive associations, the clinical review was conducted 
with knowledge about the in vivo function of the gene, as reported 
in the OMIM or KO mouse data sources. While one strength of 
this approach was the identification of candidate nsSNPs with 
strong biological plausibility, using the data described in the KO 
mouse and OMIM resources presented challenges as many 
cataloged mutations cause complete loss-of-function associated 
with extreme, multi-organ phenotypes that are not easily 
translated into plausible clinical manifestations. Furthermore, in 
KO mice, many of the mutations were associated with embryonic 
lethality, or the phenotypic characterization was restricted to early 
embryonic anomalies [38]. Many phenotypes were also charac- 
terized at the molecular or cellular level, which posed similar 
translational challenges. The EMR data was also restricted to 
binary disease data, which prevented us from analyzing previous- 
ly-reported quantitative phenotypes (e.g. LDL levels) known to be 
aflfected by some of these genes. While we observed some instances 
where the homozygotes had a set of coded symptoms that might 
be expected based on the function of the gene, a further review of 
the clinical records demonstrated that these symptoms were 
attributable to causes unrelated to the function of the gene. 
Overall, these chaUeiiges severely limited the utility of this general 
approach. Indeed, only four replicable associations were identified 
among 711 nsSNPs evaluated, resulting in only a 0.5% success 
rate. 

The clinical review was also used in an effort to detect genetic 
pleiotropy. In particular, we were interested in identifying nsSNP 
variants that perturb broad underlying physiological mechanisms. 
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Such variants would be expected to distribute their efTects across a 
broad clinical spectrum, resulting in multiple weak statistical 
associations with a number of mechanistically related phenotypes. 
Hence, our lists of diagnoses evaluated included those that 
occurred at rates modestly higher than would be expected by 
chance (i.e. those with p<0.05) in order to increase our sensitivity 
for detecting pleiotropy. An example of a pleiotropic nsSNP that 
we identified was in the F5 gene which encodes a clotting factor 
known to be associated with thrombosis [39,40] and was 
associated with modestly elevated rates of spontaneous abortions, 
DVTs and strokes. Interestingly, the nsSNP in F5 that we observed 
(rs6031) is not the well-characterized F5 Leiden mutation found 
among EA subjects. This nsSNP (rs6031) was predominantly 
found in AAs, none of which carried the F5 Leiden mutation. 

We selected nsSNPs with MAFs below 10% in bodi EA and AA 
subjects. We hypothesized that nsSNPs maintained at low frequencies 
across both ancestries were more likely to be located within regions 
under negative evolutionary selection pressure and could be 
associated with relatively strong genotype/phenotype associations. 
Our data, however, are not consistent with this hypothesis, as the 
replicable associations that we observed, such as con™lsions and an 
nsSNP in GPR98, were typically seen within a single racial group. 
Our hypothesis would have suggested that the associations would be 
persistent across races. Hence, it is more likely that the SNPs had low 
frequencies across races due to factors other than selection pressure. 
As an alternative approach to SNP selection, we could have selected 
SNPs which were predicted to be damaging using predictive software 
[4 1 ,42] , which may have given a higher proportion of significant and 
replicable of SNP associations. 

Of the four genes that we identified that had replicable 
phenotypic associations, three, including F5 described above, have 
been previously reported. Variants in CNGB3 have been associated 
with achromatoplasia and juvenile macular degeneration [43-45]. 
GPR98 has been associated with febrile seizures in humans and 
knock-out mice develop audiogenic seizures [46^8]. HGFAC 
(hepatocyte growth factor activator) encodes a proteolytic enzyme 
that cleaves and activates hepatocyte growth factor [49]. Mice 
deficient in this gene demonstrate a decreased capacity to repair 
injured intestinal epithelium^^. We observed that an nsSNP 
variant in this gene was associated with a clinical code for GI 
bleeding, suggesting that this variant may be impairing endothelial 
repair mechanisms. 

A benefit of using EMR-derived data for this type of genetic 
analysis is that the study population may either carry a high risk 
genetic background or have experienced environmental challenges 
that allow a phenotype to be expressed. For instance, the HGFAC 
knock-out mouse did not have an observable GI endothehal 
phenotype untU challenged with a caustic agent [36]. Similarly, 
patients may seek healthcare at a tertiary care center such as 
VUMC because they had the requisite exposures to unmask the 
phenotype. Hence, an EMR-based study population may be 
enriched in extreme phenotypes. 

While EMR data is a rich resource for hypothesis generation 
and testing, there are challenges to its use in this type of analysis. 
As compared to targeted epidemiological studies or cfinical trials, 
phenotypes entered into the EMR are often not concisely defined 
and the degree and extent of clinical ascertainment are variably 
affected by the reason a patient is seeking clinical care. For 
instance, a patient whose only records available are those from a 
particular clinical specialty may have limited information pertain- 
ing to diseases outside of that specialty. The direction of this bias 
would tend to underestimate prevalence rates. This bias is 
compounded by the fact that not all of the data captured in an 
EMR is amenable to extraction using coded data, and others may 



require more advanced methods, such as natural language 
processing [50], which often require modifications to solve 
particular problems. For instance, a record review of the F5 
mutation homozygotes revealed that 5 of the 13 (38%) women had 
a history of spontaneous abortions. Only 3 of 13 were 
identified using ICD-9 codes and problem lists. It is also difficult to 
gauge the clinical severity of a problem strictiy from easUy- 
extractable coded data. This limitation tends to lead to non- 
differential misclassification and attenuates statistical associations. 
The data sets that we analyzed were not expressly curated for the 
phenotypes that were evaluated. Hence, the differential disease 
compositions of the data sets could account for our low replication 
rates. For instance, there is human and mouse data supporting a 
role for PTAFR gene variants and susceptibility to invasive 
streptococcal infections [29,30]. While a nsSNP in this gene was 
associated with infections consistent with streptococcus in our initial 
data evaluation, these associations were not replicated. This could 
be due to a different pattern of infections between the data sets. 
Alternatively, the initial analysis was based on a comparison of 
homozygotes, and thereby did not assume a specific mode of genetic 
inheritance. While this association was not replicated using an 
additive model, when we used a recessive genetic model, we found 
that the PTAFR variant was associated with acute sinusitis infection 
and upper respiratory infections (data not shown), suggesting that it 
may be acting through a re(:essi\ (: mode of action. 

A final limitation of this study was the relatively small sample 
size of the study population, which limited the power to detect 
associations, especially when evaluating low frequency variants. 
This limitation was likely an important reason as to why a purely 
statistical approach to identifying genotype-phenotype associations 
did not perform better than chance. Hence, a large sample size 
would likely have allowed us to identify a reduced set of genotype- 
phenotype associations using only statistical criteria. This has been 
the true with pheWAS approach, in general, which has shown 
robust phenotype replication and discovery when studies are 
adequately powered [9,11]. 

In summary, we (explored an intensi\T, clinically-oriented 
approach to identify biologically-plausible gene-phenotype associ- 
ations using an EMR linked to genetic data. As EMR data 
resources mature and genotyping data continues to become 
increasingly available, approaches such as ours may facilitate the 
identification of the specific genetic underpinnings of numerous 
clinical conditions. Our analyses also demonstrate the large 
potential for identifying compelling, but likely spurious associa- 
tions that arise when working with high-dimensional, correlated 
phenotypic data sets. Hence, future approaches that integrate 
biological data into the discovery process will be txitical to identify 
valid and clinically meaningful gene-disease associations. 
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